In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", 100)
pd.set_option('precision', 4)
plt.rcParams["figure.figsize"] = (15, 6)
%matplotlib inline
In [2]:
gdp= pd.read_csv("/data/gdp.csv")
gdp.info()
In [3]:
gdp.head()
Out[3]:
In [4]:
gdp = gdp.set_index(pd.to_datetime(gdp.Year, format="%Y"))
gdp.head()
Out[4]:
In [5]:
gdp.iso2c.unique()
Out[5]:
In [6]:
plt.figure(figsize = (15, 8))
gdp.groupby(["Country"]).GDP.plot(legend = True);
In [7]:
usa = gdp.query("iso2c == 'US'").PerCapGDP
usa.head()
Out[7]:
In [8]:
usa.plot()
Out[8]:
In [9]:
usa_rolling_mean = usa.rolling(window = 10, center = False).mean()
usa_rolling_std = usa.rolling(window = 10, center = False).std()
plt.plot(usa, label = "actual")
plt.plot(usa_rolling_mean, label = "moving avg")
plt.plot(usa_rolling_std, label = "Std")
plt.legend()
Out[9]:
In [10]:
usa_log = np.log(usa)
usa_rolling_mean = usa_log.rolling(window = 10, center = False).mean()
usa_rolling_std = usa_log.rolling(window = 10, center = False).std()
plt.plot(usa_log)
plt.plot(usa_rolling_mean)
plt.plot(usa_rolling_std)
Out[10]:
In [11]:
from statsmodels.tsa.stattools import adfuller, acf, pacf
In [12]:
def test_stationary(tseries):
dftest = adfuller(tseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4],
index=['Test Statistic','p-value'
,'#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
test_stationary(usa)
Because the p-value is larger than 0.05, the moving average is not constant over time and the null hypothesis of the Dickey-Fuller test cannot be rejected. This shows that the weekly time series is not stationary. Before you can apply ARIMA models for forecasting, you need to transform this time series into a stationary time series.
In [13]:
test_stationary(usa_log)
Because the p-value is now smaller than 0.05, we have necessary evidence to reject the null hypothesis of the Dickey-Fuller test. This shows that the time series is stationary at log scale. We can apply ARIMA model forcasting on this log scaled data.
In [14]:
from statsmodels.tsa.seasonal import seasonal_decompose
In [15]:
decomposition = seasonal_decompose(usa)
trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid
plt.subplot(411)
plt.plot(usa, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal, label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
Now that you have stationarized your time series, you could go on and model residuals (fit lines between values in the plot). However, as the patterns for the trend and seasonality information extracted from the series that are plotted after decomposition are still not consistent and cannot be scaled back to the original values, you cannot use this approach to create reliable forecasts. In case your time series does show a strong and consistent seasonal trend, here is a good article that describes the use of the SARIMAX model in Python.
In [16]:
usa_diff = usa - usa.shift(periods=1)
usa_diff.dropna(inplace=True)
plt.plot(usa_diff)
test_stationary(usa_diff)
To apply an ARIMA model to your time series, you need to find optimal values for the following three model parameters (p,d,q):
There are two ways to determine the number of AR and MA terms. The first is to use the arma_order_select_ic function in Python. The second uses plots of the autocorrelation function (ACF) and partial autocorrelation function (PACF). This article describes in detail the purpose of the ACF and PACF plots.
In [17]:
#ACF and PACF plots
plt.figure(figsize=(15, 6))
def plot_acf_pacf(ts):
lag_acf = acf(ts, nlags=10)
lag_pacf = pacf(ts, nlags=10, method='ols')
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-7.96/np.sqrt(len(ts)),linestyle='--',color='gray')
plt.axhline(y=7.96/np.sqrt(len(ts)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')
#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-7.96/np.sqrt(len(ts)),linestyle='--',color='gray')
plt.axhline(y=7.96/np.sqrt(len(ts)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()
usa_diff = usa - usa.shift(periods=1)
usa_diff.dropna(inplace=True)
plot_acf_pacf(usa_diff)
In this plot, the 'p' and 'q' values can be determined as follows:
This means that the optimal values for the ARIMA(p,d,q) model are (2,1,0).
If your assessment of the ACF and PACF plots differs from the values suggested by the arma_order_select_ic function, you should plug in different values for the p and q terms and use the model fit results to study the AIC values and proceed with the model with a lower AIC value
Run the next code cell to plot the ARIMA model using the values (2,1,0):
In [18]:
from statsmodels.tsa.arima_model import ARIMA
In [19]:
model = ARIMA(usa, order=(2, 1, 0))
results_ARIMA = model.fit(disp=-1)
plt.plot(usa_diff, label = "Original")
plt.plot(results_ARIMA.fittedvalues, color='red', label = "fitted")
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues - usa_diff)**2))
plt.legend()
Out[19]:
You can measure whether the results of your model fit the underlying data by using the residual sum of squares (RSS) metric. A small RSS indicates that the model fits tightly to the data. Yet another approach to validate the ARIMA model appropriateness is by performing residual analysis. Print the results of the ARIMA model and plot the residuals. A density plot of the residual error values indicates a normal distribution centered around zero mean. Also, the residuals do not violate the assumptions of constant location and scale with most values in the range (-1,1).
In [20]:
print(results_ARIMA.summary())
In [21]:
residuals = pd.DataFrame(results_ARIMA.resid)
residuals.plot(kind='kde')
print(residuals.describe())
Now that the model is returning the results you want to see, you can scale the model predictions back to the original scale. For this, you will remove the first order differencing and take exponent to restore the predictions back to their original scale. The lower the root mean square error (RMSE) and the closer it is to 0, the better are the model predictions in being closer to actual values.
In [22]:
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.head()
Out[22]:
In [23]:
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
predictions_ARIMA = pd.Series(usa.ix[0], index=usa.index)
predictions_ARIMA = predictions_ARIMA.add(predictions_ARIMA_diff_cumsum, fill_value=0)
In [24]:
plt.plot(usa, label = "Actual")
plt.plot(predictions_ARIMA, label = "ARIMA fitted")
plt.legend()
Out[24]:
In [25]:
from sklearn.metrics import mean_squared_error
In [26]:
size = int(len(usa) - 5)
train, test = usa[0:size], usa[size:len(usa)]
history = [x for x in train]
predictions = list()
print('Printing Predicted vs Expected Values...')
print('\n')
for t in range(len(test)):
model = ARIMA(history, order=(2,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(float(yhat))
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('\n')
print('Printing Mean Squared Error of Predictions...')
print('Test MSE: %.6f' % error)
predictions_series = pd.Series(predictions, index = test.index)
In [33]:
#rolling one-step out-of-sample
fig, ax = plt.subplots()
ax.set(title='GDP Per Capita forecasting', xlabel='Date', ylabel='Per Capita GDP')
ax.plot(usa, marker = '.', color = "red", label='observed')
ax.plot(predictions_series, color = "green", label='forecast')
legend = ax.legend(loc='upper left')
legend.get_frame().set_facecolor('w')
In [ ]: